Superstore Sales Analysis and Forecasting¶
Project Overview¶
This project focuses on analyzing sales data from a retail superstore and forecasting future sales using advanced data analysis and time series forecasting techniques. The goal is to explore the key drivers of sales, identify trends and seasonality, and develop a predictive model to forecast future sales performance.
Key Features¶
- Exploratory Data Analysis (EDA): Conduct a detailed investigation of the dataset, including
- identifying top-selling products, sales by category, and sales trends over time.
- Data Cleaning: Handle missing values, convert date columns, and derive new columns for better
analysis.
- Time Series Analysis: Decompose the sales data into trend, seasonal, and residual components.
- Forecasting: Use a SARIMAX model to forecast future sales.
Tools and Technologies Used¶
- Python: Primary programming language used for data manipulation and analysis.
- Pandas: Used for data manipulation and analysis.
- Matplotlib and Seaborn: For data visualization.
- Plotly: For interactive charts and graphs.
- Statsmodels: For time series analysis and forecasting using SARIMAX.
- SQLite: Used for storing cleaned data and performing SQL queries.
Dataset¶
The dataset used for this analysis is the "Superstore Sales.csv", which contains sales information for a retail superstore. The dataset includes the following columns:
- Order Date: Date when the order was placed.
- Ship Date: Date when the order was shipped.
- Product Name: Name of the product sold.
- Sales: Revenue generated from the sale.
- Category: Product category.
- Postal Code: Customer's postal code.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.statespace.sarimax import SARIMAX
import sqlite3
In [3]:
def execute_query(conn, query) -> pd.DataFrame :
"""Executes a SQL query and return the result as a DataFrame."""
try:
return pd.read_sql_query(query, conn)
except Exception as e:
print(f"Error executing query: {e}")
return pd.DataFrame()
# Load the data
try:
sales_data = pd.read_csv('Superstore Sales.csv')
print(f"Data loaded successfully with {sales_data.shape[0]} rows and {sales_data.shape[1]} columns.")
except FileNotFoundError:
print("Error: The dataset file 'Superstore Sales.csv' was not found.")
exit()
# Display the first few rows
print("Initial Dataset Overview:")
print(sales_data.head())
# Create SQL database
conn = sqlite3.connect('SalesData.db')
sales_data.to_sql('Sales_Data', conn, if_exists='replace', index=False, chunksize=500)
Data loaded successfully with 9800 rows and 18 columns.
Initial Dataset Overview:
Row ID Order ID Order Date Ship Date Ship Mode Customer ID \
0 1 CA-2017-152156 08/11/2017 11/11/2017 Second Class CG-12520
1 2 CA-2017-152156 08/11/2017 11/11/2017 Second Class CG-12520
2 3 CA-2017-138688 12/06/2017 16/06/2017 Second Class DV-13045
3 4 US-2016-108966 11/10/2016 18/10/2016 Standard Class SO-20335
4 5 US-2016-108966 11/10/2016 18/10/2016 Standard Class SO-20335
Customer Name Segment Country City State \
0 Claire Gute Consumer United States Henderson Kentucky
1 Claire Gute Consumer United States Henderson Kentucky
2 Darrin Van Huff Corporate United States Los Angeles California
3 Sean O'Donnell Consumer United States Fort Lauderdale Florida
4 Sean O'Donnell Consumer United States Fort Lauderdale Florida
Postal Code Region Product ID Category Sub-Category \
0 42420.0 South FUR-BO-10001798 Furniture Bookcases
1 42420.0 South FUR-CH-10000454 Furniture Chairs
2 90036.0 West OFF-LA-10000240 Office Supplies Labels
3 33311.0 South FUR-TA-10000577 Furniture Tables
4 33311.0 South OFF-ST-10000760 Office Supplies Storage
Product Name Sales
0 Bush Somerset Collection Bookcase 261.9600
1 Hon Deluxe Fabric Upholstered Stacking Chairs,... 731.9400
2 Self-Adhesive Address Labels for Typewriters b... 14.6200
3 Bretford CR4500 Series Slim Rectangular Table 957.5775
4 Eldon Fold 'N Roll Cart System 22.3680
Out[3]:
9800
In [4]:
# Check and handle missing data
print("\nMissing Values Per Column:")
print(sales_data.isnull().sum())
sales_data = sales_data.dropna(subset=['Postal Code'])
Missing Values Per Column: Row ID 0 Order ID 0 Order Date 0 Ship Date 0 Ship Mode 0 Customer ID 0 Customer Name 0 Segment 0 Country 0 City 0 State 0 Postal Code 11 Region 0 Product ID 0 Category 0 Sub-Category 0 Product Name 0 Sales 0 dtype: int64
In [5]:
# Change the format to datetime
sales_data['Order Date'] = pd.to_datetime(sales_data['Order Date'], format='%d/%m/%Y', errors='coerce')
sales_data['Ship Date'] = pd.to_datetime(sales_data['Ship Date'], format='%d/%m/%Y', errors='coerce')
# Add derived columns
sales_data['Month'] = sales_data['Order Date'].dt.to_period('M').astype(str)
sales_data['Year'] = sales_data['Order Date'].dt.year.astype(str)
# Save the cleaned dataset to database
sales_data.to_sql('superstore_cleaned', conn, if_exists='replace', index=False)
Out[5]:
9789
In [6]:
# Exploratory Data Analysis (EDA)
print("\nExploratory Data Analysis:")
# Indentify top 10 selling products by revenue
top_product_query = """
SELECT
"Product Name",
SUM(Sales) as "Total Sales"
FROM
superstore_cleaned
GROUP BY
"Product Name"
ORDER BY
"Total Sales" DESC
LIMIT 10;
"""
top_products = execute_query(conn, top_product_query)
fig = px.pie(
top_products,
names='Product Name',
values='Total Sales',
title='Top 10 Products by Revenue',
labels={'Total Sales': 'Revenue'}
)
fig.show()
# Revenue by product category
top_category_query = """
SELECT
"Category",
SUM(Sales) as "Revenue"
FROM
superstore_cleaned
GROUP BY
"Category"
ORDER BY
"Revenue" DESC;
"""
top_category = execute_query(conn, top_category_query)
fig = px.pie(
top_category,
names='Category',
values='Revenue',
title='Revenue Distribution by Category'
)
fig.show()
# Calculate monthly sales for trend analysis
monthly_sales_query = """
SELECT
strftime('%Y-%m', "Order Date") as Month,
SUM(Sales) AS Sales
FROM
superstore_cleaned
GROUP BY
Month
ORDER BY
Month;
"""
monthly_sales = execute_query(conn, monthly_sales_query)
monthly_sales['Month'] = pd.to_datetime(monthly_sales['Month'])
monthly_sales.set_index('Month', inplace=True)
fig = px.line(
monthly_sales,
x=monthly_sales.index,
y='Sales',
title='Montly Sales Trend',
labels={'index': 'Date', 'Sales': 'Monthly Sales'}
)
fig.show()
conn.close()
Exploratory Data Analysis:
In [7]:
# Decompose the sales data into trend, seasonal, and residual components
decomposition = seasonal_decompose(monthly_sales['Sales'], model='additive')
decomposition.plot()
plt.show()
In [10]:
# Define SARIMA model
sarimax_model = SARIMAX(
monthly_sales['Sales'],
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12)
)
# Fit the model
sarimax_fit = sarimax_model.fit()
# Forecast
sarimax_forecast = sarimax_fit.forecast(steps=12)
# Plot the historical sales and forecast
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['Sales'], label='Historical Sales')
plt.plot(sarimax_forecast, label='Forecasted Sales', color='red')
plt.legend()
plt.title('Sales Forecast')
plt.show()
C:\Users\jonat\anaconda3\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. C:\Users\jonat\anaconda3\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. C:\Users\jonat\anaconda3\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.